In this EDA (Exploratory Data Analysis) project I will explore a dataset of the 2016 elections’ financial contributions, while examining its structure, variable, patterns and relationships between those variables.
My goal with this project is to find interesting insights that could lead to further investigations.
I will start with exploring few one-variable distributions, compare them against the ‘amount’ variable and try to find interesting patterns and relationships between those variables.
The data in this project was taken from the fcc.gov website. It includes all individual campaign contributions for the 2016 presidential elections and contributions from authorized committees.
The original dataset had 19 columns, which I reduced to 14 in order to limit the scope of this project.
!Important to note that the ‘finance’ dataset uploaded here was ‘munged’ in a different file called ‘all-munge.R’, which can be found in the root folder of this project on Github. You can read more about the structure of this project in the readme.MD file, which is located in the root folder as well.
The first variable to explore will be the ‘amount’ variable, which is the money a contributor donated to one or more of the candidates and is the only vector in the downloaded dataset that is not a character, rather a numeric vector.
First, lets take a look at the dataset, get familiar with the variables and ask questions about the data.
## [1] 7299993 14
## [1] "cand_id" "candidate" "contributor" "city" "state"
## [6] "zipcode" "employer" "occupation" "amount" "date"
## [11] "tran_id" "election_tp" "party" "gender"
We can see above that now the data set has 7,299,993 million observations of contributions contributed, with 14 different columns that correspond to each observation. The columns’ names meanings are:
“cand_id” - candidate ID
“candidate” - Candidate name
“contributor” - Contributor name
“city” - Contributor city
“state” - Contributor state
“zipcode” - Contributor zipcode
“employer” - Contributor employer
“occupation” - Contributor occupation
“amount” - Amount contributed
“date” - Contribution transaction date
“tran_id” - The contribution transaction ID
“election_tp” - Election type (General or Primaries)
“party” - The political party of the candidate
“gender” - The contributor’s gender
## cand_id candidate contributor
## P00003392:3486114 Clinton:3486114 TRUITT, ROBERTA : 1520
## P60007168:2021278 Sanders:2021278 BODNICK, KATIE : 1313
## P80001571: 742733 Trump : 742733 AMISIAL, WILFRID: 1078
## P60006111: 539457 Cruz : 539457 PURCELL, LARRY : 722
## P60005915: 244027 Carson : 244027 SMITH, DAVID : 686
## P60006723: 98528 Rubio : 98528 WILLIAMS, JAMES : 679
## (Other) : 167856 (Other): 167856 (Other) :7293995
## city state zipcode
## NEW YORK : 204203 california :1294446 Min. : 0
## LOS ANGELES : 102524 new york : 640831 1st Qu.:20815
## SAN FRANCISCO: 90577 texas : 539992 Median :52245
## WASHINGTON : 90229 florida : 420024 Mean :52643
## BROOKLYN : 87279 washington : 293342 3rd Qu.:88030
## SEATTLE : 83548 massachusetts: 279133 Max. :99999
## (Other) :6641633 (Other) :3832225 NA's :378
## employer occupation
## N/A : 988025 RETIRED :1630191
## RETIRED : 902186 NOT EMPLOYED : 618935
## SELF-EMPLOYED : 531510 INFORMATION REQUESTED: 238300
## NONE : 447954 ATTORNEY : 198752
## NOT EMPLOYED : 262109 TEACHER : 140600
## INFORMATION REQUESTED: 238602 PHYSICIAN : 111102
## (Other) :3929607 (Other) :4362113
## amount date tran_id
## Min. : 0 Min. :2013-10-01 A4EA7F7D9338943869B5: 8
## 1st Qu.: 15 1st Qu.:2016-03-02 AA2F3125A0DB141928EB: 8
## Median : 28 Median :2016-05-28 AAC874DDA3EA04584A39: 8
## Mean : 127 Mean :2016-05-19 AB37264C070244DDDBF7: 8
## 3rd Qu.: 93 3rd Qu.:2016-09-04 SA17A.4143 : 7
## Max. :4904861 Max. :2016-12-31 A1F4C793991D1416D939: 6
## (Other) :7299948
## election_tp party gender
## G2016:2593021 Democrat :5514536 female:3678175
## P2016:4706972 Green : 8926 male :3621818
## Independent: 1275
## Republican :1775256
##
##
##
Many different interesting points about the data can be seen in the above table and act as a ‘trailhead’ for investigative avenues. Let’s look at a few of them:
It seems that Hillary Clinton, under the ‘candidate’ column, had the highest number of occurrences, followed by Bernie Sanders and Donald Trump. Did she also lead with the total amount of contributions and not only the number of contributions?
Other things we can see in this first glance at the dataset with the number of distributions are:
New York is the leading city with 204,204 contributions; California is the leading state with the highest number of contributions (1,294,446); Retired people take the first and second places with a number of contributions under the ‘occupation’ and ‘employer’ variables; The Democratic party had about 4 times more contributions than the Republican party (5,556,219 / 1,786,731); The amounts donated to all parties started from few cents and reached 4,904,861, which was made by a single contributor. I wonder who that was.
I will focus on only few of the questions and variables above in the scope of this project and drill down where there is a need to understand better the distributions and connections between the variables.
Let’s see first how much money was contributed in these elections by all contributors.
## [1] 928335424
The sum of all contributions to all candidates in 2016 elections was $932,698,768.
The first (left) plot seems to be a non-descriptive one, but in fact we can learn a few basic things about the ‘amount’ distribution from it. First, we can see the huge gap between the highest and lowest contributions, following the x axis amounts. Second, we can see from the first plot that most of the contributions were not very far from 0 and, for sure, not in the millions. Looking at the second plot above we can see that indeed most of the contributions were below $200, after dropping the top 3% of the contributions. The median contribution in this distribution is $28. I will split the amount donated to big and small donors on the $200 mark and check which candidates were suported by big and small contributors.
With less outliers and variability, it is easier to look at the data and its distribution in what seems now like a normal distribution.
We had 18 Republicans, 5 Democrats, 1 Independent and 1 Green, out of 25 candidates in 2016 elections. Republicans outnumbered the Democrats 3 times and 18 times the Green and Independent parties.
There are many questions that this party map of candidates brings up. First, why do the Republicans have so many more candidates than the other next big party?
Another obvious point is that there were mainly two parties competing in these elections, where the small ones seemed to have a very slim chance of winning. This is not just because of the minimal representation by candidates, it is also because the respectively small amounts that were collected by those parties compared to the two big ones, which will be demonstrated later on.
The American political system has been based on two-system-party since its inception, with the Federalists and the Democratic-Republican Parties, until today with the Democratic and Republican parties. An interesting question for further investigation can be, what are the chances of a third party to be counted in the American political system, and can we learn this from the available data?
##
## 25 50 100 10 5 15 27 250 35
## 1054950 879091 775541 639945 437311 327867 312131 272740 149244
## 20
## 142332
The most frequent contributions were 25, 50, 100, 10, 15, 5, 27, 250, 35 and 20. Interestingly enough, 9 out of the 10 amounts are multiplications of 5. The 7th amount in line is $27. This is the number that Bernie Sanders’ campaign advertised as their most popular contribution.
Looking at the different histograms, Hillary Clinton seems to lead with number of contributions, followed by Sanders and Trump. It is not really clear from this plot who are the next ones in decending order. It seems that it can be Rubio, Cruz, Bush or Carson. I will dive into who really received the highest number of contributions and who received the highest amount of contributions.
The bar plot says it all. Clinton lead these elsections with the number of contributions, followed by Sanders, Trump, Cruz, Carson, Rubio, Paul, Fiorina, Bush and Kasich, in this order. So, how many contributions exactly each of the top 10 candidates received?
## # A tibble: 25 x 3
## candidate contributions percent
## <chr> <int> <dbl>
## 1 Clinton 3486114 47.8
## 2 Sanders 2021278 27.7
## 3 Trump 742733 10.2
## 4 Cruz 539457 7.39
## 5 Carson 244027 3.34
## 6 Rubio 98528 1.35
## 7 Paul 31170 0.43
## 8 Bush 27446 0.38
## 9 Fiorina 27410 0.38
## 10 Kasich 25166 0.34
## 11 Johnson 13184 0.18
## 12 Stein 8926 0.12
## 13 Walker 6519 0.09
## 14 Huckabee 6387 0.09
## 15 Christie 5782 0.08
## 16 O'Malley 5036 0.07
## 17 Graham 3712 0.05
## 18 Santorum 1676 0.02
## 19 Lessig 1326 0.02
## 20 McMullin 1275 0.02
## 21 Perry 896 0.01
## 22 Webb 782 0.01
## 23 Jindal 764 0.01
## 24 Pataki 323 0
## 25 Gilmore 76 0
Hillary Clinton received 48% of the total contributions, followed by Bernie Sanders with 27% and then Donald Trump with only 10% of the total contributions in both the primaries and the general elections. Hillary Clinton received 4.5 times more contributions than Donald Trump, yet it did not help her to win the race.
## # A tibble: 13 x 4
## occupation number sum percent
## <chr> <int> <dbl> <dbl>
## 1 RETIRED 1630191 162078962. 22.3
## 2 NOT EMPLOYED 618935 31058668. 8.5
## 3 INFORMATION REQUESTED 238300 37703600. 3.3
## 4 ATTORNEY 198752 51511381. 2.7
## 5 TEACHER 140600 7892231. 1.9
## 6 PHYSICIAN 111102 19181435. 1.5
## 7 HOMEMAKER 107877 29961875. 1.5
## 8 PROFESSOR 101698 10090862. 1.4
## 9 CONSULTANT 85940 16439027. 1.2
## 10 ENGINEER 75766 8257328. 1
## 11 SALES 62491 5806454. 0.9
## 12 LAWYER 56140 14769849. 0.8
## 13 MANAGER 54363 6839639. 0.7
This chart above cannot tells us much since there are about 120,000 occupations that donors added to their contribution forms. The text in the field was open to insert any characters without restriction, thus many occupations were writen many times in different variations
In order to analyze this facet of the dataset, we will have to write an algorithm that searches for similar terms and combine them together.
Nevertheless, in the above chart the percent of retired donors is pretty impressive, compared to them being 14.5% of the population in 2016.
Also interesting to see here is the high percentage of donors who filled ‘unemployed’ at that time. I would think unemployed people won’t have the money to donate, but they did, in their ten thousands.
Women had a slight lead with the number of contributions.
## # A tibble: 2 x 2
## gender contributions
## <chr> <int>
## 1 female 3678175
## 2 male 3621818
Women contributed 3,712,479 times and men contributed 3,661,116 time. Interesting to note here that women also voted more than men in those elections. not only contributed more. By the Center for American Women and Politics, since 1964, women voted more than men in every election.
Source: Center for American Women and Politics
Why did women vote or contributed more than men? Maybe it is related to the fact that there were 51% women and 49% men in the US in 2016? That is a very interesting question to study in further research about women involvement in political issues, which, unfortunately, is out of the scope of this project.
## # A tibble: 25 x 4
## candidate contributions sum percent
## <chr> <int> <dbl> <dbl>
## 1 Clinton 3486114 480974942. 51.8
## 2 Trump 742733 121000442. 13.0
## 3 Sanders 2021278 92929614. 10.0
## 4 Cruz 539457 69170922. 7.45
## 5 Rubio 98528 39775178. 4.28
## 6 Bush 27446 32961134. 3.55
## 7 Carson 244027 28633656. 3.08
## 8 Kasich 25166 14656219. 1.58
## 9 Christie 5782 8033299. 0.87
## 10 Fiorina 27410 6680714. 0.72
## # ... with 15 more rows
As we can see, only 8 candidate out of the 25 had more than 1% of the sum of all contributions. Hillary Clinton received 52% of the contributions, followed by Donald Trump with 13% and Bernie Sanders with 11%.
## # A tibble: 1,299,884 x 4
## contributor count average sum
## <chr> <int> <dbl> <dbl>
## 1 HILLARY VICTORY FUND - UNITEMIZED 14 3090797. 43271164
## 2 SMITH, MICHAEL 544 177. 96286.
## 3 MILLER, MICHAEL 506 174. 88166.
## 4 BOCH, ERNIE 1 86937. 86937.
## 5 SMITH, JAMES 452 175. 79053.
## 6 SMITH, WILLIAM 598 123. 73695.
## 7 SMITH, DAVID 686 102. 69864.
## 8 BROWN, MICHAEL 362 188. 67997.
## 9 WILLIAMS, DAVID 376 178. 67053.
## 10 SMITH, ROBERT 542 121. 65797.
## # ... with 1,299,874 more rows
In 2016 elections rich donors could contribute as much as $360,000. With Hillary Clinton’s campaign. That’s how it worked: Donors who were rich - and willing - could give $5,400 to the Clinton campaign, $33,400 to the Democratic National Committee and $10,000 to each of the state parties (32 with Democratic committees), about $350,000 in all. A joint fundraising committee gave the donor do it all with a single check.
On Jan. 1, the contribution limits reset for the party committees, and the Hillary Victory Fund could go back to its donors for another $350,000 in party funds.
While the maximum donation to a presidential campaign was $2,700 for the primary elections (plus another $2,700 for the general), the Hillary Victory Fund could accept much larger contributions because it was a so-called joint fundraising committee comprised of multiple committees.
So, the Hillary Victory Fund was a fake contributor, and an extreme outlier, in our data. The lack of information about the real contributors must have some kind of influence on one or more analysis of the variables looked at in this project. The HVF funneled big amounts of money for Hillary Clinton’s campaign, using the states’ committees as a legal stamp to send money way and back to reach the maximum amount per donor, leaving only 1% of the contributions to the state’s committees. As a result, we do not know from the data we have, which is the government’s official 2016 contributions database, who gave and how much they gave to Clinton, from her biggest donors. Democratic donors, knowing the funds would end up with Clinton’s campaign, wrote six-figure checks to influence the election - 100 times larger than allowed. (from investor.com)
The actual big contributors, that were masked by the HVF, like Google, Facebook, JPMorgan Chase & Co, Stanford University, US Dept of State and others, can be found here.
As we can see above, on the
35,209 people contributed to more than 1 candidate, out of 1,307,046 recorded unique contributors, which is 2.7%. We can see that as the number of candidates goes up, the number of donors goes down, which seems logical. Who were the donor who contributed to maximum number of candidates?
## # A tibble: 6 x 4
## # Groups: contributor [6]
## contributor city candidates sum
## <chr> <chr> <int> <dbl>
## 1 WILSON, KIRK DALLAS 9 11730.
## 2 CALABRESI, STEVEN PROVIDENCE 8 24300
## 3 DRUMMOND, SARA MONTALBA 8 6700
## 4 AGRON, DOMINICK DINGMANS FERRY 7 4154.
## 5 FRIESS, FOSTER MR. JACKSON 7 18900
## 6 BRYANT, GORDON BEAUFORT 6 2025
Wilson Kirk, from Dallas, Texas (there were couple of Wilson Kirks in this database), was the one to donate to maximum number of candidates, 9 in number. Let’s see some more information about him and his contributions with a plot.
Wilson Kirk, in 2015, contributed first to Fiorina and Huckabee and ended with Bush and Christie, while giving Bush 3 times. He then halted his contributions until the end of November, when he gave Trump twice. I wonder, as an obvious Republican supporter, why didn’t he give to Trump throughout 2016?
I will look now into Hillary Clinton’s well-known claim that her campaign relied on small donations (less than $100). I went ahead, doubled the number and cut the data on the $200 mark (as other sources suggested), as the point that separates big and small donors.
##
## above $200 below $200
## 325147 135582
As we can see above, Clinton had almost 2.5 times more contributions above $200 and not as she claimed. I wonder what is the ratio for Trump and Sanders, who were her two main opponents in the two elections.
##
## above $200 below $200
## 110134 381298
Trump had almost 3.5 times more small donors than big donors!
##
## above $200 below $200
## 117120 102076
Sanders had almost the same number of small and big contributors. He had 1.1 more big donors than small ones.
Let’s see the distribution of contributions above and below $200 for all candidates in a graph.
It seems that every candidate received more money from ‘big donors’ than small ones in 2016’s elections, except Donald Trump. Trump by far passed the rest of the candidates with small donors contributions. Hillary, on the other hand, was the biggest consumer of big donations, while Sanders, Cruz and Carson receive more balanced ratio of contributinos from small and big donors.
Working on the above data, I noticed that some people contributed more than once. Let’s see who they were.
## # A tibble: 6 x 6
## # Groups: contributor [6]
## contributor candidate count average sum split_200
## <chr> <chr> <int> <dbl> <dbl> <chr>
## 1 TRUITT, ROBERTA Clinton 1520 1 1520 above $200
## 2 BODNICK, KATIE Clinton 1313 4 5465. above $200
## 3 AMISIAL, WILFRID Clinton 1078 3 3526. above $200
## 4 PURCELL, LARRY Sanders 705 4 3138. above $200
## 5 SAUNDERS, ELIZABETH Clinton 675 6 4324. above $200
## 6 SCHWARTZ, HILARY Clinton 622 7 4429. above $200
Wow! Some people contributed hundreds of times. Truitt Roberta, as the leader on this plot, donated 1,520 times with an average of $1, and she gave to the Clinton campaign. There can be many reasons for that. It can be an automated system that does the online contributions for a person or an army of trolls who pump-up the number of contributions for their candidate. An interesting question here for me is who was the candidate that had the highest number of repeating contributors? I will consider here that extreme-repeating contributors as ones who donated more than 100 times.
## # A tibble: 8 x 3
## candidate sum_count average
## <chr> <int> <dbl>
## 1 Clinton 199062 149.
## 2 Sanders 76739 137.
## 3 Cruz 8453 143.
## 4 Trump 392 131.
## 5 Johnson 243 122.
## 6 Rubio 217 108.
## 7 Fiorina 107 107
## 8 Carson 104 104
Hilary Clinton was ahead of everyone else with more than 200K of ‘extreme contributions’, followed by Sanders with 75K. The number at the top of the bars is the average number of repeating contributors per extreme donor.
Red and blue lines, respectively, are the Republican and Democratic primaries and the green line is the general election.
People started to donate already in 2014, but in very small numbers, as can be seen further down. Most of the donors started contributing in early 2015 and until November 2016. Some kept on giving even after the elections, but it died after January 2017. We can see a steady built-up of the amount donated leading to the highest amounts given in the months and days before the general election. There was a pick of contributions between February and June of 2016 and a drop right after. This might be related to the Republican and Democratic primaries that took place between January 1st, 2016 and Jan 15th, 2016.
Women contributed 1.2 times more than men for the Democratic party. At the other side of the aisle, the Republican men contributed 1.8 times more than women. The Green party had even wider gap between men and women’s number of contributions. Men contributed twice as much as women to that party. The Independent party was the only one to have almost identical number of contributions from men and women. We can also see here that Democrats received the highest number of contributions. Did they also received the highest amount of contributions?
## # A tibble: 4 x 2
## party sum_contrib
## <chr> <dbl>
## 1 Democrat 578843055.
## 2 Republican 348029605.
## 3 Green 1119156.
## 4 Independent 343608.
Democrats received $582M, almost twice as much as the Republicans. The green party received $1M, which is more than 3 times contributors than the Independent party.
Women donated to Clinton 1.5 times more than men, and man donated to Trump 1.7 time more than women. It seems that the gender’s role with contributions to those two candidates was pretty dominant.
Now, looking at the distribution of the donations, the voting pattern looks clearer. Donations were mostly given prior to an election. The assumption that the contributions peak we saw in the previous plot between February and June 2016 is related to the primaries, was correct.
## # A tibble: 8 x 3
## # Groups: party [4]
## party gender num_contrib
## <chr> <chr> <int>
## 1 Democrat female 3014728
## 2 Democrat male 2499808
## 3 Republican male 1115394
## 4 Republican female 659862
## 5 Green male 5966
## 6 Green female 2960
## 7 Independent male 650
## 8 Independent female 625
Looking at the above faceted data, the trend we saw earlier with growing contributions over time and closer to the general elections, is missing from the Republican party. There actually seems to be trend down towards the General elections, on the Republican side.
3 out of the 4 candidates who received early donations were Republicans; Cruz, Paul and Rubio. Rubio was the only one who received contributions in 2013 and most 2014. Did starting early helped Rubio? Let’s see how much money each candidate collected along the way to the elections.
(active chart)
Leading the state contributions are California with $160M, New York with $130M, Texas with $85M and Florida with $62M. As far as cities, Palm Beach pops up first with the size of the red dot and the dark color. Did the amount contributed from each state reflect the size of its population? Let’s take a look at it with 2 charts. In further investigation here I would add few variables, like gender and party, and try to find hints for relationships between all variables.
We can see that there is a very strong correlation (0.935) between the number of contributions per state and the number of citizens in this state. In a further investigation I would analyze the correlation between cities and their financial contributions to the different parties.
Exploring 2016 elections’ finances dataset and describing my findings in numbers, plots and maps was a great challenge! This was a great opportunity to practice plotting with the build-in R and useful external packages, like plyr, dplyr, ggplot2, data.table and others.
The work on this dataset also taught me a lot of things I was not aware of, despite them being publicaly available and me being an avid follower of politics.
Data can be misleading if it is not connected to the real-life events that produced the topic being investigated. For example, the Clinton campaign had many contributions from many contributors coming in, represented by only one name (Hillary Victory Foundation). The ‘contributor’ HVF was an outlier that skewed the data by masking all those contributors. Nevertheless, to remove this ‘contributor’ from the dataset meant to remove also the sums of the donations that this line in the dataset encompasses, which in this case was $43M, not a negligible amount, so I left it as it is and hacked my way around it.
I started the project with a dataset of California financial contributions, the state I live in, but found it lacking data that is available on the national level, knowing that I could always go back and drill down into the state’s data. It seemed like more of a challenge to work with the national dataset and it indeed was exactly that.
Choosing to work with more than 7 million rows on a laptop was at the beginning very time consuming, but I found better ways and tools to work with for a given task to overcome the resource obstacle. For example, I experienced issues with dplyr and knittr, so I moved to work with sqldf, which worked. Still slow, but worked. I worked with built-in r functions, as with the “Percent of contributions per gender” block above, but I ended up doing most of the code with dplyr, for its straight-forward orientation. In order to improve the workflow, I also created a sample file, which I used to run time-consuming code chunks.
Naturally, the challenging part and the part that took the longest time to accomplish was the data wrangling. Here is some of the ‘heavy lifting’ I did with the ‘all-munge.R’ file:
I Changed a few of the variable names; shortened the candidates names to have only their last name; restricted the data to only the primaries and general elections of 2016; removed all the donated amounts that had minus (-); added a column to represent the candidate party affiliation; Added a column with the gender of the contributor based on a pre-defined database that I downloaded to my computer; added a new column with the day and the year, extracted from the contributions’ date column; which, all in all, ended up as a better-orgnized dataset to work with when doing the EDA.
Removing and adding new variables - I found through this project that, on one hand, you want to minimize the length of columns for the sake of speed, and on the other hand, you find that those same variables can be meaningful further down the analysis. I had to go back and recreate the program, adding the old variables back to the dataset.
Another challenge was to choose which libraries to work with on the maps. In order to be able to map the distributions of variables on a map, I had to learn some new packages, like leaflet, tmap and ggmap.
A very interesting topic to expore in further analysis would be donations VS votes in the 2016 elections and project it on the 2020 elections.